Storytelling and narrative are fundamental to human experience, intertwined with our social and cultural engagement. As such, researchers have long attempted to create systems that can generate stories automatically. In recent years, powered by deep learning and massive data resources, automatic story generation has shown significant advances. However, considerable challenges, like the need for global coherence in generated stories, still hamper generative models from reaching the same storytelling ability as human narrators. To tackle these challenges, many studies seek to inject structured knowledge into the generation process, which is referred to as structure knowledge-enhanced story generation. Incorporating external knowledge can enhance the logical coherence among story events, achieve better knowledge grounding, and alleviate over-generalization and repetition problems in stories. This survey provides the latest and comprehensive review of this research field: (i) we present a systematical taxonomy regarding how existing methods integrate structured knowledge into story generation; (ii) we summarize involved story corpora, structured knowledge datasets, and evaluation metrics; (iii) we give multidimensional insights into the challenges of knowledge-enhanced story generation and cast light on promising directions for future study.
translated by 谷歌翻译
变压器的注意机制有效地从输入序列中提取相关信息。然而,自我注意力的二次复杂性W.R.T序列长度会产生沉重的计算和记忆负担,尤其是对于长序列的任务。现有的加速器在这些任务中面临性能退化。为此,我们建议Salo为长序列提供杂交稀疏注意机制。Salo包含一个数据调度程序,将混合稀疏注意模式映射到硬件和空间加速器上,以执行有效的注意力计算。我们表明,与GPU和CPU实施相比,Salo平均达到17.66 X和89.33倍的速度,即典型的工作负载,即Longformer和VIL。
translated by 谷歌翻译
Panoptic图像分割是计算机视觉任务,即在图像中查找像素组并为其分配语义类别和对象实例标识符。由于其在机器人技术和自动驾驶中的关键应用,图像细分的研究变得越来越流行。因此,研究社区依靠公开可用的基准数据集来推进计算机视觉中的最新技术。但是,由于将图像标记为高昂的成本,因此缺乏适合全景分割的公开地面真相标签。高标签成本还使得将现有数据集扩展到视频域和多相机设置是一项挑战。因此,我们介绍了Waymo Open DataSet:全景视频全景分割数据集,这是一个大型数据集,它提供了用于自主驾驶的高质量的全景分割标签。我们使用公开的Waymo打开数据集生成数据集,利用各种相机图像集。随着时间的推移,我们的标签是一致的,用于视频处理,并且在车辆上安装的多个摄像头保持一致,以了解全景的理解。具体而言,我们为28个语义类别和2,860个时间序列提供标签,这些标签由在三个不同地理位置驾驶的自动驾驶汽车上安装的五个摄像机捕获,从而导致总共标记为100k标记的相机图像。据我们所知,这使我们的数据集比现有的数据集大量数据集大的数量级。我们进一步提出了一个新的基准,用于全景视频全景分割,并根据DeepLab模型家族建立许多强大的基准。我们将公开制作基准和代码。在https://waymo.com/open上找到数据集。
translated by 谷歌翻译
图像预训练,当前用于广泛视觉任务的当前事实范式在视频识别领域中通常不太受青睐。相比之下,一种共同的策略是直接从头开始使用时空卷积神经网络(CNN)训练。尽管如此,有趣的是,通过仔细研究这些从划痕学到的CNN,我们注意到存在某些3D内核比其他人具有更强的外观建模能力,可以说表明外观信息在学习中已经很好地散布了。受到这一观察的启发,我们假设有效利用图像预训练的关键在于学习空间和时间特征的分解,并将图像预训练作为初始化3D内核之前的外观。此外,我们提出了空间可分离(STS)卷积,该卷积将特征通道明确将特征通道分为空间和时间基团,以进一步使时空特征更彻底地分解3D CNN。我们的实验表明,简单地用ST替换3D卷积可以显着改善3D CNN的范围,而无需增加参数和计算动力学400和一些v2的计算。此外,这条新的培训管道始终以显着加速的视频识别取得更好的结果。例如,在强大的256- epecoch 128-GPU基线上,我们在Kinetics-400上获得了 +0.6%的慢速1,同时仅以40个GPU进行微调,而对50个时代进行了微调。代码和型号可在https://github.com/ucsc-vlaa/image-pretraining-for-video上找到。
translated by 谷歌翻译
变压器出现为可视识别的强大工具。除了在广泛的视觉基准上展示竞争性能外,最近的作品还争辩说,变形金刚比卷曲神经网络(CNNS)更强大。令人惊讶的是,我们发现这些结论是从不公平的实验设置中得出的,其中变压器和CNN在不同的尺度上比较,并用不同的训练框架应用。在本文中,我们的目标是在变压器和CNN之间提供第一个公平和深入的比较,重点是鲁棒性评估。通过我们的统一培训设置,我们首先挑战以前的信念,使得在衡量对抗性鲁棒性时越来越多的CNN。更令人惊讶的是,如果他们合理地采用变形金刚的培训食谱,我们发现CNNS可以很容易地作为捍卫对抗性攻击的变形金刚。在关于推广样本的泛化的同时,我们显示了对(外部)大规模数据集的预训练不是对实现变压器来实现比CNN更好的性能的根本请求。此外,我们的消融表明,这种更强大的概括主要受到变压器的自我关注架构本身的影响,而不是通过其他培训设置。我们希望这项工作可以帮助社区更好地理解和基准变压器和CNN的鲁棒性。代码和模型在https://github.com/ytongbai/vits-vs-cnns上公开使用。
translated by 谷歌翻译
Accurate reporting of energy and carbon usage is essential for understanding the potential climate impacts of machine learning research. We introduce a framework that makes this easier by providing a simple interface for tracking realtime energy consumption and carbon emissions, as well as generating standardized online appendices. Utilizing this framework, we create a leaderboard for energy efficient reinforcement learning algorithms to incentivize responsible research in this area as an example for other areas of machine learning. Finally, based on case studies using our framework, we propose strategies for mitigation of carbon emissions and reduction of energy consumption. By making accounting easier, we hope to further the sustainable development of machine learning experiments and spur more research into energy efficient algorithms.
translated by 谷歌翻译
Unsupervised domain adaptation (UDA) for semantic segmentation is a promising task freeing people from heavy annotation work. However, domain discrepancies in low-level image statistics and high-level contexts compromise the segmentation performance over the target domain. A key idea to tackle this problem is to perform both image-level and feature-level adaptation jointly. Unfortunately, there is a lack of such unified approaches for UDA tasks in the existing literature. This paper proposes a novel UDA pipeline for semantic segmentation that unifies image-level and feature-level adaptation. Concretely, for image-level domain shifts, we propose a global photometric alignment module and a global texture alignment module that align images in the source and target domains in terms of image-level properties. For feature-level domain shifts, we perform global manifold alignment by projecting pixel features from both domains onto the feature manifold of the source domain; and we further regularize category centers in the source domain through a category-oriented triplet loss and perform target domain consistency regularization over augmented target domain images. Experimental results demonstrate that our pipeline significantly outperforms previous methods. In the commonly tested GTA5$\rightarrow$Cityscapes task, our proposed method using Deeplab V3+ as the backbone surpasses previous SOTA by 8%, achieving 58.2% in mIoU.
translated by 谷歌翻译
Compressed videos often exhibit visually annoying artifacts, known as Perceivable Encoding Artifacts (PEAs), which dramatically degrade video visual quality. Subjective and objective measures capable of identifying and quantifying various types of PEAs are critical in improving visual quality. In this paper, we investigate the influence of four spatial PEAs (i.e. blurring, blocking, bleeding, and ringing) and two temporal PEAs (i.e. flickering and floating) on video quality. For spatial artifacts, we propose a visual saliency model with a low computational cost and higher consistency with human visual perception. In terms of temporal artifacts, self-attention based TimeSFormer is improved to detect temporal artifacts. Based on the six types of PEAs, a quality metric called Saliency-Aware Spatio-Temporal Artifacts Measurement (SSTAM) is proposed. Experimental results demonstrate that the proposed method outperforms state-of-the-art metrics. We believe that SSTAM will be beneficial for optimizing video coding techniques.
translated by 谷歌翻译
Image Virtual try-on aims at replacing the cloth on a personal image with a garment image (in-shop clothes), which has attracted increasing attention from the multimedia and computer vision communities. Prior methods successfully preserve the character of clothing images, however, occlusion remains a pernicious effect for realistic virtual try-on. In this work, we first present a comprehensive analysis of the occlusions and categorize them into two aspects: i) Inherent-Occlusion: the ghost of the former cloth still exists in the try-on image; ii) Acquired-Occlusion: the target cloth warps to the unreasonable body part. Based on the in-depth analysis, we find that the occlusions can be simulated by a novel semantically-guided mixup module, which can generate semantic-specific occluded images that work together with the try-on images to facilitate training a de-occlusion try-on (DOC-VTON) framework. Specifically, DOC-VTON first conducts a sharpened semantic parsing on the try-on person. Aided by semantics guidance and pose prior, various complexities of texture are selectively blending with human parts in a copy-and-paste manner. Then, the Generative Module (GM) is utilized to take charge of synthesizing the final try-on image and learning to de-occlusion jointly. In comparison to the state-of-the-art methods, DOC-VTON achieves better perceptual quality by reducing occlusion effects.
translated by 谷歌翻译
Panoptic Part Segmentation (PPS) unifies panoptic segmentation and part segmentation into one task. Previous works utilize separated approaches to handle thing, stuff, and part predictions without shared computation and task association. We aim to unify these tasks at the architectural level, designing the first end-to-end unified framework named Panoptic-PartFormer. Moreover, we find the previous metric PartPQ biases to PQ. To handle both issues, we make the following contributions: Firstly, we design a meta-architecture that decouples part feature and things/stuff feature, respectively. We model things, stuff, and parts as object queries and directly learn to optimize all three forms of prediction as a unified mask prediction and classification problem. We term our model as Panoptic-PartFormer. Secondly, we propose a new metric Part-Whole Quality (PWQ) to better measure such task from both pixel-region and part-whole perspectives. It can also decouple the error for part segmentation and panoptic segmentation. Thirdly, inspired by Mask2Former, based on our meta-architecture, we propose Panoptic-PartFormer++ and design a new part-whole cross attention scheme to further boost part segmentation qualities. We design a new part-whole interaction method using masked cross attention. Finally, the extensive ablation studies and analysis demonstrate the effectiveness of both Panoptic-PartFormer and Panoptic-PartFormer++. Compared with previous Panoptic-PartFormer, our Panoptic-PartFormer++ achieves 2% PartPQ and 3% PWQ improvements on the Cityscapes PPS dataset and 5% PartPQ on the Pascal Context PPS dataset. On both datasets, Panoptic-PartFormer++ achieves new state-of-the-art results with a significant cost drop of 70% on GFlops and 50% on parameters. Our models can serve as a strong baseline and aid future research in PPS. Code will be available.
translated by 谷歌翻译